FM-KZ: An even simpler alphabet-independent FM-index
نویسندگان
چکیده
In an earlier work [6] we presented a simple FM-index variant, based on the idea of Huffman-compressing the text and then applying the Burrows-Wheeler transform over it. The main drawback of using Huffman was its lack of synchronizing properties, forcing us to supply another bit stream indicating the Huffman codeword boundaries. In this way, the resulting index needed O(n(H0+1)) bits of space but with the constant 2 (concerning the main term). There are several options aiming to mitigate the overhead in space, with various effects on the query handling speed. In this work we propose Kautz-Zeckendorf coding as a both simple and practical replacement for Huffman. We dub the new index FM-KZ. We also present an efficient implementation of the rank operation, which is the main building brick of the FM-KZ. Experimental results show that our index provides an attractive space/time tradeoff in comparison with existing succinct data structures, and in the DNA test it even wins both in search time and space use. An additional asset of our solution is its relative simplicity.
منابع مشابه
First Huffman, Then Burrows-Wheeler: A Simple Alphabet-Independent FM-Index
Main Results. The basic string matching problem is to determine the occurrences of a short pattern P = p1p2 . . . pm in a large text T = t1t2 . . . tn, over an alphabet of size σ. Indexes are structures built on the text to speed up searches, but they used to take up much space. In recent years, succinct text indexes have appeared. A prominent example is the FM-index [2], which takes little spa...
متن کاملA simple alphabet-independent FM-index
We design a succinct full-text index based on the idea of Huffmancompressing the text and then applying the Burrows-Wheeler transform over it. The resulting structure can be searched as an FM-index, with the benefit of removing the sharp dependence on the alphabet size, σ, present in that structure. On a text of length n with zero-order entropy H0, our index needs O(n(H0 + 1)) bits of space, wi...
متن کاملList of Contributions The Pre - history and Future of the Block - Sorting Compression Algorithm 4
The FM-index is a succinct text index needing only O(Hkn) bits of space, where n is the text size and Hk is the kth order entropy of the text. FM-index assumes constant alphabet; it uses exponential space in the alphabet size, σ. In this paper we show how the same ideas can be used to obtain an index needing O(Hkn) bits of space, with the constant factor depending only logarithmically on σ. Our...
متن کاملAn Efficient Composite-Alphabet Transform for String Matching under a Restricted Alphabet Set
String matching is a problem of finding all occurrences of a short pattern on a relatively long reference string. While a number of methods have been presented, most published implementations assume several restrictions due to some practical issues. We focus on the restriction of the alphabet size, which is usually set to be 256 in many string matching libraries. When strings must be handled ov...
متن کاملAn Alphabet-Friendly FM-Index
We show that, by combining an existing compression boosting technique with the wavelet tree data structure, we are able to design a variant of the FM-index which scales well with the size of the input alphabet Σ. The size of the new index built on a string T [1, n] is bounded by nHk(T )+O ( (n log log n)/ log|Σ| n ) bits, where Hk(T ) is the k-th order empirical entropy of T . The above bound h...
متن کامل